Exploratory Analysis

We’ve imported all of the metadata for each congressional member (House, party, etc.), the trading disclosures downloaded from the QuiverQuant API, as well as the stock data pulled using tidyquant. I added the following columns: report_lag and overdue, which is the amount of days after the transaction that the disclosure was made, and the amount that the lag is “overdue”, i.e. past the 45 day limit. We’ll take a glimpse at a few random rows from the dataset below:

report_date trans_date Ticker Representative Transaction Amount House Range report_lag overdue DNR first_name last_name party
2020-05-08 2020-04-14 CTSH David Perdue Sale 15001 Senate $15,001 - $50,000 24 0 0 David Perdue R
2019-12-14 2019-11-26 DAL Donald Beyer Purchase 1001 Representatives $1,001-$15,000 18 0 0 Donald Beyer D
2018-07-06 2017-07-03 OKE David Perdue Sale 15001 Senate $15,001 - $50,000 368 323 0 David Perdue R
2018-05-23 2018-03-28 GE Carlos Curbelo Purchase 1001 Representatives $1,001-$15,000 56 11 0 Carlos Curbelo R
2020-07-24 2020-06-29 BAM James Inhofe Sale 1001 Senate $1,001 - $15,000 25 0 0 James Inhofe R
2017-12-06 2017-09-08 PKI David Perdue Sale 1001 Senate $1,001 - $15,000 89 44 0 David Perdue R
2017-03-29 2017-03-07 GM James Inhofe Purchase 50001 Senate $50,001 - $100,000 22 0 0 James Inhofe R
2018-11-19 2018-10-01 INTC Katherine Clark Purchase 1001 Representatives $1,001-$15,000 49 4 0 Katherine Clark D
2018-09-29 2018-07-12 MET Gary Palmer Purchase 1001 Representatives $1,001-$15,000 79 34 0 Gary Palmer R
2020-07-15 2020-07-01 RUN Alan Lowenthal Sale 40 Representatives $40.00 14 0 0 Alan Lowenthal D

Identifying the Worst Offenders of the STOCK ACT

Disclosing a stock transaction after the 45 day imposed requirement should be a red flag as we might find that those who are engaged in ‘suspicious’ trading would be unwilling to report a transaction right away. Then again, we may find that someone who has acted on insider information would be more willing to abide by the rules so that their behavior wouldn’t seem suspicious at all. In any event, lets look at the average reporting lags for each of the congressional members.

There are not only quite a few congressional members who are consistently delinquent in reporting, but also many who are very delinquent. We might conjecture that there is a relationship between the delinquency of a member and how frequently they trade.

In terms of political parties, the Democrats average almost 10 days longer to report their stocks. Both parties average almost double the mandated 45 day STOCK act deadline.

However, that doesn’t seem to be the case. A log transoformation on both of the variables doesnt seem to help either.

Stocks with the Highest Reporting Lags

We could also look at the average reporting lag for individual stocks to see if there are some ticker transactions which people are more likely to report later than others. First, we can sort the tickers in terms of descending mean reporting lag and plot all of the values. I’ve removed the Ticker labels since they are difficult to display with 800 different tick marks, but the plot is interactive, so zoom in on a section and hover over the points to see the symbol.

Interesting for sure. Lets filter it further to see all of those stocks that, on average, were subject to delinquent reporting by congressional members. Lets set a fairly high threshold, 250 DAT, to cut down on the observations.The red dotted line represents the 45 DAT reporting deadline.

If you hover over the points you can see the average lag associated with each one, however, the transaction count for most of these stocks is also very low. Lets cross-check the top 100 highly delinquent stocks to see if some of theses are the same stocks that are in the 100 most popular across congressional members.

trade_delin_intersection <- ticker_lags %>% 
        top_n(100, trans_count) %>% 
        inner_join(ticker_lags %>% 
                           top_n(100, mean_lag), by = c('Ticker', 'mean_lag', 'sd_lag', 'trans_count'))
Ticker mean_lag sd_lag trans_count
1 NA NA NA NA

Alas, it seems like there are no overlapping tickers for these subsets of the data.

Take a look at the graphs below. In general the higher the average delinquency of the stock, the lower the transaction count, although the relationship is not linear. We can try to model this later.

The density plot below shows the log scaled reporting lag for both houses and parties. The vertical red lines represent the (log) 45 day reporting deadline. The vast majority of transactions get reported on time, but it seems that there is some disparity between parties at the extreme end, at least for the Senate.

Let’s take a look at the total transaction amounts for each of the tickers in our database that have a total transaction count greater than 50.

Trade Data

## `summarise()` has grouped output by 'as.character(trans_date)'. You can
## override using the `.groups` argument.

Sparse Matrix Construction and Scaling

We can try to identify some preliminary clusters of trading members by the securities that they have invested in over the timeframe we have data for. Let’s start with a simple unweighted matrix with a 1 for each security has held and a 0 for each security they have not held. Since our values are binary and we don’t want to destroy the sparsity of the matrix, we will not center or scale the data at all.

trading_members <- unique(trades$Representative)

i<- c()
j <- c()
x <- c()

for(ind in 1:length(trading_members)){
        df <- filter(trades, Representative == trading_members[ind])
        unique_tickers <- unique(df$Ticker)
        j_ind <- which(tickers %in% unique_tickers)
        i_ind <- rep(ind, times = length(j_ind))
        values <- rep(1, times = length(j_ind))
        i <- c(i, i_ind)
        j <- c(j, j_ind)
        x <- c(x, values)
}

A1 <- sparseMatrix(i, j, x = x)

Our matrix has 167 rows and 818 columns which corresponds to the number of trading members and unique tickers, respectively. Let’s check out the structure of A1.

str(A1)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3334] 6 7 8 9 10 11 12 13 14 15 ...
##   ..@ p       : int [1:819] 0 12 14 88 90 92 93 94 96 97 ...
##   ..@ Dim     : int [1:2] 167 818
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:3334] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()

We can create another sparse matrix with the number of transactions a member has made for each stock expressed as a proportion of each members total transaction count. The nonzero elements in each column of data will be extracted, zero centered, and scaled such that the mean of each column is a near-zero value while still maintaining sparsity.

member_totals <- trades %>% 
        group_by(Representative) %>% 
        summarise(total_count = n())

member_trades <- trades %>% 
        group_by(Representative, Ticker) %>% 
        summarize(n = n()) %>% 
        left_join(member_totals, by = "Representative") %>% 
        rowwise() %>% 
        mutate(pr_trans = n/total_count)
## `summarise()` has grouped output by 'Representative'. You can override using
## the `.groups` argument.
i<- c()
j <- c()
x <- c()

for(ind in 1:length(trading_members)){
        df <- filter(member_trades, Representative == trading_members[ind])
        unique_tickers <- unique(df$Ticker)
        j_ind <- which(trade_tickers %in% unique_tickers)
        i_ind <- rep(ind, times = length(j_ind))
        values <- df$pr_trans
        i <- c(i, i_ind)
        j <- c(j, j_ind)
        x <- c(x, values)
}

A2 <- sparseMatrix(i, j, x = x)

for(i in 1:ncol(A2)){
        vec = A2[,i]
        ind <- which(vec != 0)
        subset = vec[ind]
        minv = min(subset)
        maxv = max(subset)
        if(length(subset)>1&sd(subset)>0){
                #A2[ind,i] <- (subset-minv)/(maxv-minv)*(2)-1
                A2[ind,i] <- scale(subset)
        } else if(length(subset)>1&sd(subset)==0){
                A2[ind,i] <- 1
        } else if(length(subset)==1&all(subset>1)){
                A2[ind,i] <- 1
        } 
}

Then we can create a sparse matrix with each stocks’ proportion of the total transaction amount (purchase and sale) for each members portfolio. The non-zero elements in each column will be scaled accordingly as described above.

member_totals <- trades %>% 
        group_by(Representative) %>% 
        summarize(total_amt = sum(Amount))


member_ticker_weights <- trades %>% 
        group_by(Representative, Ticker) %>% 
        summarize(ticker_amt = sum(Amount)) %>% 
        full_join(member_totals, by = 'Representative') %>% 
        rowwise() %>% 
        mutate(pr_weight = ticker_amt/total_amt)
## `summarise()` has grouped output by 'Representative'. You can override using
## the `.groups` argument.
i<- c()
j <- c()
x <- c()

for(ind in 1:length(trading_members)){
  df <- filter(member_ticker_weights, Representative == trading_members[ind])
  unique_tickers <- unique(df$Ticker)
  j_ind <- which(trade_tickers %in% unique_tickers)
  i_ind <- rep(ind, times = length(j_ind))
  values <- df$pr_weight
  i <- c(i, i_ind)
  j <- c(j, j_ind)
  x <- c(x, values)
}

A3 <- sparseMatrix(i, j, x = x)

for(i in 1:ncol(A3)){
        vec = A3[,i]
        ind <- which(vec != 0)
        subset = vec[ind]
        maxv = max(vec)
        minv = min(vec)
        if(length(subset)!=1&sd(subset)!=0)
                #A3[ind,i] <- (subset-minv)/(maxv-minv)*2 - 1
                A3[ind,i] <- scale(subset)
}

Next, we’ll create sparse representations of the data from a date perspective. For each date that a member made a transaction, we will include a proportional value of their total transactions on that day relative to the total value of all transactions that member made on all days they had transactions. Columns will be centered and scaled.

date_set <- unique(trades$trans_date)


i<- c()
j <- c()
x <- c()

for(ind in 1:length(trading_members)){
        df <- filter(trades, Representative == trading_members[ind]) %>% 
                group_by(trans_date) %>% 
                summarize(total_amount = sum(Amount),
                          count = n(),
                          avg_trans = total_amount/count)
        total <- sum(df$total_amount)
        df$prop <- df$total_amount/total
        unique_dates <- unique(df$trans_date)
        j_ind <- which(date_set %in% unique_dates)
        i_ind <- rep(ind, times = length(j_ind))
        values <- df$prop
        i <- c(i, i_ind)
        j <- c(j, j_ind)
        x <- c(x, values)
}

A4 <- sparseMatrix(i, j, x = x)

for(i in 1:ncol(A4)){
        vec = A4[,i]
        ind <- which(vec != 0)
        subset = vec[ind]
        maxv = max(vec)
        minv = min(vec)
        if(length(subset)!=1&sd(subset)!=0)
                A4[ind,i] <- scale(subset)
                #A4[ind,i] <- (subset-minv)/(maxv-minv)
        else if(length(subset)!=1&sd(subset)==0)
                A4[ind,i] <- 1
        else if(length(subset)==1)
                A4[ind,i] <- 1
}

Matrix A5 shown below is the total number of transactions each member every day as a proportion of the total number of transactions they made over the data period. Columns will be centered and scaled.

date_set <- unique(trades$trans_date)


i<- c()
j <- c()
x <- c()

for(ind in 1:length(trading_members)){
        df <- filter(trades, Representative == trading_members[ind]) %>% 
                group_by(trans_date) %>% 
                count()
        total_trades <- sum(df$n)
        df$prop_trades <- df$n/total_trades
        unique_dates <- unique(df$trans_date)
        j_ind <- which(date_set %in% unique_dates)
        i_ind <- rep(ind, times = length(j_ind))
        values <- df$prop_trades
        i <- c(i, i_ind)
        j <- c(j, j_ind)
        x <- c(x, values)
}

A5 <- sparseMatrix(i, j, x = x)

for(i in 1:ncol(A5)){
        vec = A5[,i]
        ind <- which(vec != 0)
        subset = vec[ind]
        maxv = max(vec)
        minv = min(vec)
        if(length(subset)!=1&sd(subset)!=0)
                A5[ind,i] <- scale(subset)
                #A5[ind,i] <- (subset-minv)/(maxv-minv)
        else if(length(subset)!=1&sd(subset)==0)
                A5[ind,i] <- 1
        else if(length(subset)==1)
                A5[ind,i] <- 1
}

Finally, our last matrix records the total number of all transactions for each member on every day that they made a transaction. Then the nonzero elements were centered and scaled accordingly.

i<- c()
j <- c()
x <- c()

for(ind in 1:length(trading_members)){
        df <- filter(trades, Representative == trading_members[ind]) %>% 
                group_by(trans_date) %>% 
                summarize(total_amount = sum(Amount))
        unique_dates <- unique(df$trans_date)
        j_ind <- which(date_set %in% unique_dates)
        i_ind <- rep(ind, times = length(j_ind))
        values <- df$total_amount
        i <- c(i, i_ind)
        j <- c(j, j_ind)
        x <- c(x, values)
}

A6 <- sparseMatrix(i, j, x = x)

for(i in 1:ncol(A6)){
        vec = A6[,i]
        ind <- which(vec != 0)
        subset = vec[ind]
        maxv = max(vec)
        minv = min(vec)
        if(length(vec[ind])!=1&sd(vec[ind])!=0)
                A6[ind,i] <- scale(subset)
                #A6[ind,i] <- (subset-minv)/(maxv-minv)
        else if(length(subset)!=1&sd(subset)==0)
                A6[ind,i] <- 1
        else if(length(subset)==1)
                A6[ind,i] <- 1
}

Dimensionality Reduction with SparseSVD

When working with sparse matrices, there isn’t always a clear cut way to reduce the dimensionality. We tried several of the algorithms from the sparsepca package, but found that they were a bit slow and the results were not always meaningful in terms of our datasets. Instead, we used the sparsesvd function from the package of the same name. The first twenty principal components were calculated and we ran the DBSCAN algorithm on only these components. For DBSCAN, we set the minPts argument to n+1 and used a KNN distplot to pick a decent value for \(\epsilon\)

A1knplot <- ggplot(as.data.frame(as.matrix(A1spc)), aes(x = V1, y = V2, color = factor(A1km_naive$cluster),text = trading_members))+
        geom_point(alpha=.8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'Kmeans\nCluster',
             title = 'A1 - Portfolio sPCA with K-means Clustering')

ggplotly(A1knplot)
A1kplot <- ggplot(as.data.frame(as.matrix(A1spc)), aes(x = V1, y = V2, color = factor(A1km$cluster),text = trading_members))+
        geom_point(alpha=.8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'sK-means\nCluster',
             title = 'A1 - Portfolio sPCA with Spherical K-means Clustering')

ggplotly(A1kplot)
A1plot <- ggplot(as.data.frame(as.matrix(A1spc)), aes(x = V1, y = V2, color = factor(A1clust$cluster),text = trading_members))+
        geom_point(alpha=.8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'DBSCAN\nCluster',
             title = 'A1 - Portfolio sPCA')
        
ggplotly(A1plot)

A2kplot <- ggplot(as.data.frame(as.matrix(A2spc)), aes(x = V1, y = V2, color = factor(A2km$cluster), text = trading_members))+
        geom_point(size = 2, alpha = .6)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'kMeans\nCluster',
             title = 'A2 - Ticker Transaction Count Proportion sPCA')

ggplotly(A2kplot)
A2plot <- ggplot(as.data.frame(as.matrix(A2spc)), aes(x = V1, y = V2, color = factor(A2clust$cluster), text = trading_members))+
        geom_point(size = 2, alpha = .6)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'DBSCAN\nCluster',
             title = 'A2 - Ticker Transaction Count Proportion sPCA')

ggplotly(A2plot)

A3kplot <- ggplot(as.data.frame(as.matrix(A3spc)), aes(x = V1, y = V2, color = factor(A3km$cluster), text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option ='D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'kMeans\nCluster',
             title = 'A3 - Ticker Transaction Amount Proportion sPCA')

ggplotly(A3kplot)
A3plot <- ggplot(as.data.frame(as.matrix(A3spc)), aes(x = V1, y = V2, color = factor(A3clust$cluster), text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option ='D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'DBSCAN\nCluster',
             title = 'A3 - Ticker Transaction Amount Proportion sPCA')

ggplotly(A3plot)

A4kplot <- ggplot(as.data.frame(as.matrix(A4spc)), aes(x = V1, y = V2, color = factor(A4km$cluster), text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'kMeans\nCluster',
             title = 'A4 - Daily Transactional Value Proportion sPCA')
ggplotly(A4kplot)
A4plot <- ggplot(as.data.frame(as.matrix(A4spc)), aes(x = V1, y = V2, color = factor(A4clust$cluster), text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'DBSCAN\nCluster',
             title = 'A4 - Daily Transactional Value Proportion sPCA')
ggplotly(A4plot)

A5kplot <- ggplot(as.data.frame(as.matrix(A5spc)), aes(x = V1, y = V2, color = factor(A5km$cluster),text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'kMeans\nCluster',
             title = 'A5 - Daily Transactional Count Proportion sPCA')

ggplotly(A5kplot)
A5plot <- ggplot(as.data.frame(as.matrix(A5spc)), aes(x = V1, y = V2, color = factor(A5clust$cluster),text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'DBSCAN\nCluster',
             title = 'A5 - Daily Transactional Count Proportion sPCA')

ggplotly(A5plot)

A6kplot <- ggplot(as.data.frame(as.matrix(A6spc)), aes(x = V1, y = V2, color = factor(A6km$cluster), text = trading_members))+
        geom_point(alpha = .8)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'kMeans\nCluster',
             title = 'A6 - Daily Transaction Amount sPCA')

ggplotly(A6kplot)
A6plot <- ggplot(as.data.frame(as.matrix(A6spc)), aes(x = V1, y = V2, color = factor(A6clust$cluster), text = trading_members))+
        geom_point(alpha = .6, size = 2)+
        scale_color_viridis_d(option = 'D', begin = .4, end = .8)+
        labs(x = 'sPC1',
             y = 'sPC2',
             color = 'DBSCAN\nCluster',
             title = 'A6 - Daily Transaction Amount sPCA')

ggplotly(A6plot)
cosinesim_calc <- function(sparse_matrix, clusters, id){
        dim1 <- dim(sparse_matrix)[1]
        data_df <- data.frame(matrix(cosSparse(t(sparse_matrix)), nrow = dim1, ncol = dim1))
        colnames(data_df) <- id
        rownames(data_df) <- id
        member_df <- data.frame(id = id, cluster = clusters)
        cl_list <- unique(clusters)
        
        res <- list(cos_sim = data_df, members = member_df)
        dput(res, file = paste('output/', deparse(substitute(sparse_matrix)), '_clustering.R', sep = ''))
}

cosinesim_calc(A1, A1clust$cluster, id = trading_members)
cosinesim_calc(A2, A2clust$cluster, id = trading_members)
cosinesim_calc(A3, A3clust$cluster, id = trading_members)
cosinesim_calc(A4, A4clust$cluster, id = trading_members)
cosinesim_calc(A5, A5clust$cluster, id = trading_members)
cosinesim_calc(A6, A6clust$cluster, id = trading_members)
A1_clustering <- dget('../analysis/output/A1_clustering.R')
A2_clustering <- dget('../analysis/output/A2_clustering.R')
A3_clustering <- dget('../analysis/output/A3_clustering.R')
A4_clustering <- dget('../analysis/output/A4_clustering.R')
A5_clustering <- dget('../analysis/output/A5_clustering.R')
A6_clustering <- dget('../analysis/output/A6_clustering.R')
A1_outliers <- A1_clustering$members %>% 
        filter(cluster == 0)
A2_outliers <- A2_clustering$members %>% 
        filter(cluster == 0)
A3_outliers <- A3_clustering$members %>% 
        filter(cluster == 0)
A4_outliers <- A4_clustering$members %>% 
        filter(cluster == 0)
A5_outliers <- A5_clustering$members %>% 
        filter(cluster == 0)
A6_outliers <- A6_clustering$members %>% 
        filter(cluster == 0)
outlier_members <- unique(c(A1_outliers$id, A2_outliers$id, A3_outliers$id, A4_outliers$id, A5_outliers$id, A6_outliers$id))
outliers_df <- data.frame(members = outlier_members,
                          A1 = ifelse(outlier_members %in% A1_outliers$id, 1, 0), 
                          A2 = ifelse(outlier_members %in% A2_outliers$id, 1, 0),
                          A3 = ifelse(outlier_members %in% A3_outliers$id, 1, 0),
                          A4 = ifelse(outlier_members %in% A4_outliers$id , 1, 0),
                          A5 = ifelse(outlier_members %in% A5_outliers$id, 1, 0),
                          A6 = ifelse(outlier_members %in% A6_outliers$id, 1, 0)) %>% 
        rowwise() %>% 
        mutate(count = sum(c_across(A1:A6), na.rm = T))
out_plot <- ggplot(outliers_df, aes(x = reorder(members, -count), y = count, fill = factor(count)))+
        geom_bar(stat = 'identity')+
        scale_fill_viridis_d(option = 'D', begin = .2, end = .8)+
        theme(axis.text.x = element_text(angle = 90, vjust=.5))+
        labs(title = 'Outlier Appearance Count from DBSCAN Clustering',
             x = 'Congressional Member',
             y = 'Outlier Flag Count',
             fill = 'Flag Count')

ggplotly(out_plot)
apply(outliers_df[,2:7], 2, sum)
## A1 A2 A3 A4 A5 A6 
## 13  8  7 12 17 23
dput(outliers_df, 'outliers_df.R')
mat_list <- list(A1, A2, A3, A4, A5, A6)

res_mat <- matrix(nrow = 167, ncol = 6)

for(i in 1:length(mat_list)){
        i_ds <- vegdist(mat_list[[i]], 'euc')
        i_mean <- apply(as.matrix(i_ds), 2, mean)
        res_mat[,i] <- i_mean
}
colnames(res_mat) <- c('A1', 'A2', 'A3', 'A4', 'A5', 'A6')
res_df <- data.frame(res_mat) %>% 
        mutate(member = trading_members) %>% 
        rowwise() %>% 
        mutate(mean_ds = sum(A1, A2, A3, A4, A5, A6)/6) %>% 
        select(member, everything())


ordered_dis <- res_df[order(-res_df$mean_ds),]
dput(ordered_dis, 'dissim.R')
dnr_mem <- trades %>% 
        filter(DNR == 1) %>% 
        group_by(Representative) %>% 
        summarize(transaction_count = n(),
                  transaction_amount = sum(Amount))

dput(dnr_mem, 'dnr_mem.R')

del_mem <- trades %>% 
        filter(report_lag > 45) %>% 
        group_by(Representative) %>% 
        summarize(transaction_count = n(),
                  transaction_amount = sum(Amount),
                  lag = mean(report_lag)) %>% 
        arrange(-lag)

dput(del_mem, 'del_mem.R')
purchase_outliers <- dget('purchase_outliers.R')
sale_outliers <- dget('sale_outliers.R')

pout_count <- purchase_outliers %>% 
        group_by(Representative) %>% 
        count() %>% 
        mutate(purchases = 1) %>% 
        select(-n)

sout_count <- sale_outliers %>% 
        group_by(Representative) %>% 
        count() %>% 
        mutate(sales = 1) %>% 
        select(-n)

total_outliers <- outliers_df %>% 
        full_join(pout_count, by = c('members' = 'Representative')) %>% 
        full_join(sout_count, by = c('members' = 'Representative')) %>% 
        replace_na(list(A1 = 0, A2 = 0, A3 = 0, A4 = 0, A5 = 0, A6 = 0, purchases = 0, sales = 0)) %>% 
        select(-count)


#create an edge list
res <- matrix(nrow = 0, ncol = 0)

for(i in 2:ncol(total_outliers)){
        sub <- total_outliers$members[total_outliers[,i]==1]
        grid = tidyr::crossing(sub, sub)
        grid <- grid[grid$sub...1!=grid$sub...2,]
        print(grid)
        unq <- unique(grid$sub...1)
        edges <- grid[1:length(unq)-1,]
        res <- rbind(res, edges)
}
## # A tibble: 156 x 2
##    sub...1      sub...2         
##    <chr>        <chr>           
##  1 David Perdue Dean Phillips   
##  2 David Perdue Donna Shalala   
##  3 David Perdue Gilbert Cisneros
##  4 David Perdue Greg Gianforte  
##  5 David Perdue Josh Gottheimer 
##  6 David Perdue Lois Frankel    
##  7 David Perdue Mikie Sherrill  
##  8 David Perdue Peter Meijer    
##  9 David Perdue Susie Lee       
## 10 David Perdue Thomas Carper   
## # ... with 146 more rows
## # A tibble: 56 x 2
##    sub...1      sub...2         
##    <chr>        <chr>           
##  1 Bryan Steil  David Perdue    
##  2 Bryan Steil  Dean Phillips   
##  3 Bryan Steil  Gilbert Cisneros
##  4 Bryan Steil  Greg Gianforte  
##  5 Bryan Steil  Josh Gottheimer 
##  6 Bryan Steil  Lloyd Doggett   
##  7 Bryan Steil  Susie Lee       
##  8 David Perdue Bryan Steil     
##  9 David Perdue Dean Phillips   
## 10 David Perdue Gilbert Cisneros
## # ... with 46 more rows
## # A tibble: 42 x 2
##    sub...1          sub...2         
##    <chr>            <chr>           
##  1 Dean Phillips    Gilbert Cisneros
##  2 Dean Phillips    Greg Gianforte  
##  3 Dean Phillips    Josh Gottheimer 
##  4 Dean Phillips    Rick Allen      
##  5 Dean Phillips    Susie Lee       
##  6 Dean Phillips    Van Taylor      
##  7 Gilbert Cisneros Dean Phillips   
##  8 Gilbert Cisneros Greg Gianforte  
##  9 Gilbert Cisneros Josh Gottheimer 
## 10 Gilbert Cisneros Rick Allen      
## # ... with 32 more rows
## # A tibble: 132 x 2
##    sub...1        sub...2           
##    <chr>          <chr>             
##  1 Alan Lowenthal David Perdue      
##  2 Alan Lowenthal Gilbert Cisneros  
##  3 Alan Lowenthal Greg Gianforte    
##  4 Alan Lowenthal Josh Gottheimer   
##  5 Alan Lowenthal Lois Frankel      
##  6 Alan Lowenthal Pat Roberts       
##  7 Alan Lowenthal Sheldon Whitehouse
##  8 Alan Lowenthal Susie Lee         
##  9 Alan Lowenthal Thomas Suozzi     
## 10 Alan Lowenthal Tom Malinowski    
## # ... with 122 more rows
## # A tibble: 272 x 2
##    sub...1        sub...2         
##    <chr>          <chr>           
##  1 Alan Lowenthal David Perdue    
##  2 Alan Lowenthal Dean Phillips   
##  3 Alan Lowenthal Donald Beyer    
##  4 Alan Lowenthal Gilbert Cisneros
##  5 Alan Lowenthal Greg Gianforte  
##  6 Alan Lowenthal Josh Gottheimer 
##  7 Alan Lowenthal Kevin Hern      
##  8 Alan Lowenthal Lois Frankel    
##  9 Alan Lowenthal Mike Garcia     
## 10 Alan Lowenthal Mo Brooks       
## # ... with 262 more rows
## # A tibble: 506 x 2
##    sub...1        sub...2         
##    <chr>          <chr>           
##  1 Alan Lowenthal David Perdue    
##  2 Alan Lowenthal Dean Phillips   
##  3 Alan Lowenthal Donald Beyer    
##  4 Alan Lowenthal Gilbert Cisneros
##  5 Alan Lowenthal Greg Gianforte  
##  6 Alan Lowenthal James Inhofe    
##  7 Alan Lowenthal Jim Langevin    
##  8 Alan Lowenthal John Hoeven     
##  9 Alan Lowenthal Josh Gottheimer 
## 10 Alan Lowenthal Kelly Loeffler  
## # ... with 496 more rows
## # A tibble: 506 x 2
##    sub...1        sub...2         
##    <chr>          <chr>           
##  1 Adam Kinzinger Austin Scott    
##  2 Adam Kinzinger Brian Mast      
##  3 Adam Kinzinger David Rouzer    
##  4 Adam Kinzinger Donald Beyer    
##  5 Adam Kinzinger Donna Shalala   
##  6 Adam Kinzinger Ed Case         
##  7 Adam Kinzinger Gilbert Cisneros
##  8 Adam Kinzinger Greg Gianforte  
##  9 Adam Kinzinger Josh Gottheimer 
## 10 Adam Kinzinger K. Conaway      
## # ... with 496 more rows
## # A tibble: 272 x 2
##    sub...1        sub...2         
##    <chr>          <chr>           
##  1 Adam Kinzinger Alan Lowenthal  
##  2 Adam Kinzinger Brian Mast      
##  3 Adam Kinzinger Cheri Bustos    
##  4 Adam Kinzinger David Perdue    
##  5 Adam Kinzinger Dean Phillips   
##  6 Adam Kinzinger Donald Beyer    
##  7 Adam Kinzinger Gilbert Cisneros
##  8 Adam Kinzinger Greg Gianforte  
##  9 Adam Kinzinger Josh Gottheimer 
## 10 Adam Kinzinger Kelly Loeffler  
## # ... with 262 more rows
colnames(res) <- c('source', 'target')

network <- graph_from_data_frame(d=res, directed=F) 

# plot it
plot(network)

griddf <- as.data.frame(matrix(cosSparse(t(A1)), nrow = 167, ncol = 167))
#griddf <- data.frame(matrix(cosSparse(t(A2)), nrow = 167, ncol = 167))
df2 <- data.frame(member = trading_members, cluster = A2clust$cluster)
d_clust <- unique(A2clust$cluster)

colnames(griddf) <- trading_members

griddf <- griddf %>% 
        mutate(Rep = trading_members) %>% 
        select(Rep, everything())

melted <- gather(griddf, key = 'Rep2', value = 'cossim', 2:168) %>% 
        filter(Rep != Rep2)


heatmap <- ggplot(melted, aes(reorder(Rep, -cossim), reorder(Rep2, -cossim), fill = cossim))+
        geom_tile()+
        scale_fill_viridis_c(option = 'D')+
        theme(axis.text = element_blank(),
              axis.title = element_blank())+
        labs(fill = 'Cosine\nSimiliarity',
             title = 'Congressional Member Cosine Similarity Heatmap')

ggplotly(heatmap)
'%!in%' <- Negate('%in%')


res_mat <- matrix(data = NA, nrow = length(d_clust), ncol = 3)

for(k in d_clust){
        print(k)
        m_vec <- df2$member[df2$cluster == k]
        ind <- which(griddf$Rep %in% m_vec)
        col_ind <- which(colnames(griddf) %in% m_vec)
        
        df <- griddf[ind,c(1,col_ind)] %>% 
                gather(key = 'Rep2', value = 'cossim', 2:length(c(1, col_ind))) %>% 
                filter(Rep != Rep2)
        
        col_ind2 <- which(colnames(griddf) %!in% m_vec)
        
        non = griddf[ind, col_ind2] %>% 
                gather(key = 'Rep2', value = 'cossim', 2:length(col_ind2)) %>% 
                filter(Rep != Rep2)
        c_vec <- c(k, mean(df$cossim), mean(non$cossim))
        print(mean(df$cossim))
        print(mean(non$cossim))
        
        res_mat[k+1,] <- c_vec
}
## [1] 1
## [1] 0.07771319
## [1] 0.08723857
## [1] 0
## [1] 0.1469784
## [1] 0.08723857
colnames(res_mat) <- c('cluster', 'wics', 'wocs')
print(res_mat)
##      cluster       wics       wocs
## [1,]       0 0.14697843 0.08723857
## [2,]       1 0.07771319 0.08723857